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Abstract 

The problem of assigning centrality values to nodes and edges in graphs has been widely investigated during 
last years. Recently, a novel measure of node centrality has been proposed, called K-path centrality index, 
which is based on the propagation of messages inside a network along paths consisting of at most k edges. 
On the other hand, the importance of computing the centrality of edges has been put into evidence since 
1970's by Anthonisse and, subsequently by Girvan and Newman. In this work we propose the generalization 
of the concept of K-path centrality by defining the K-path edge centrality, a measure of centrality introduced 
to compute the importance of edges. We provide an efficient algorithm, running in O(kto), being m the 
number of edges in the graph. Thus, our technique is feasible for large scale network analysis. Finally, the 
performance of our algorithm is analyzed, discussing the results obtained against large online social network 
datasets. 
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1. Introduction 

In the context of the social knowledge management. Social Network Analysis (SNA) is attracting an increas- 
ing attention by the scientific community, in particular during the latest years. One of the main motivations 
is the unprecedented success of phenomena such as online social networks and online communities. In 
this panorama, not only from a scientific perspective but also for commercial or strategic motivations, the 
identification of the principal actors inside a network is very important. 

Such an identification requires to define an importance measure (also referred to as centrality) to weight 
nodes and/or edges. 

The simplest approaches to computing centrality consider only the local topological properties of a node/edge 
in the social network graph: for instance, the most intuitive node centrality measure is represented by the 
degree of a node, i.e., the number of social contacts of a user. Unfortunately, local measures of centrality, 
whose esteem is computationally feasible even on large networks, do not produce very faithful results [1]. 

Due to these reasons, many authors suggested to consider the whole social network topology to compute 
centrality values. A new family of centrality measures was born, called global measures. Some examples of 
global centrality measures are closeness [2, and betweenness centrality (for nodes [3], and edges [HIS]). 
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Betweenness centrality is one of the most popular measures and its computation is the core component of 
a range of algorithms and applications. Betweenness centrality relies on the idea that, in social networks, 
information flows along shortest paths: as a consequence, a node/edge has a high betweenness centrality if 
a large number of shortest paths crosses it. 

Some authors, however, raised some concerns on the effectiveness of betweenness centrality. First of all, 
the problem of computing the exact value of betweenness centrality for each node/edge of a given graph is 
computationally demanding - or even unfeasible - as the size of the analyzed network grows. Therefore, 
the need of finding fast, even if approximate, techniques to compute betweenness centrality arises and it is 
currently a relevant research topic in Social Network Analysis. 

A further issue is that the assumption that information in social networks propagates only along shortest 
paths could not be true 0. By contrast, information propagation models have been provided in which 
information, encoded as messages generated in a source node and directed toward a target node in the 
network, may flow along arbitrary paths. In the spirit of such a model, some authors [71 [8] suggested to 
perform random walks on the social network to compute centrality values. 

A prominent approach following this research line is the work proposed in In that work, the authors 
introduced a novel node centrality measure known as n-path centrality. In detail, the authors suggested to 
use self-avoiding random walks [lOj of length k (being n a suitable integer) to compute centrality values. 
They provided an approximate algorithm, running in 0{k^ n^~^" log n) being n the number of nodes and 

In this paper we extend that work [9 by introducing a measure of edge centrality. This measure is called k- 
path edge centrality. In our approach, the procedure of computing edge centrality is viewed as an information 
propagation problem. In detail, if we assume that multiple messages are generated and propagated within a 
social network, an edge is considered as "central" if it is frequently exploited to diffuse information. 

Relying on this idea, we simulate message propagations through random walks on the social network graphs. 
In our simulation, in addition, we assume that random walks are simple and of bounded length up to a constant 
and user-defined value k. The former assumption is because a random walk should be forced to pass no 
more than once through an edge; the latter, because, as in we assume that the more distant two nodes 
are, the less they influence each other. 

The computation of edge centrality has many practical applications in a wide range of contexts and, in 
particular, in the area of Knowledge-Based (KB) systems. For instance in KB systems in which data can 
be conveniently managed through graphs, the procedure of weighting edges plays a key role in identifying 
communities, i.e., groups of nodes densely connected to each other and weakly coupled with nodes residing 
outside the community itself [T^l [13]. This is useful to better organize available knowledge: think, for 
instance, to an e-commerce platform and observe that we could partition customer communities into smaller 
groups and we could selectively forward messages (like commercial advertisements) only to groups whose 
members are actually interested to them. In addition, in the context of Semantic Web, edge centralities 
are useful to quantify the strength of the relationships linking two objects and, therefore, it can be useful 
to discover new knowledge [T3]. Finally, in the context of social networks, edge centralities are helpful to 
model the intensity of the social tie between two individuals |15| : in such a case, we could extract patterns 
of interactions among users in virtual communities and analyze them to understand how a user is able to 
influence another one. The main contributions of this paper are the following: 

• We propose an approach based on random walks consisting of up-to k edges to compute edge centrality. 
In detail, we observe that many approaches in the literature have been proposed to compute node 
centrality but, comparatively, there are few studies on edge centrality computation (among them we 
cite the edge betweenness centrality introduced in the Girvan-Newman algorithm [5]). In addition, 
some authors ^ ^ successfully applied random walks to compute node centrality in networks. We 
suggest to extend these ideas in the direction of edge centrality, and, therefore, this work is the first 
attempt to compute edge centrality by means of random walks. 
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• We design an algorithm to efficiently compute edge centrality. The worst case time complexity of our 
algorithm is 0{Km), being m the number of edges in the social network graph and k a constant (and 
typically small) factor. Therefore, the running time of our algorithm scales in linear fashion against the 
number of edges of a social network. This is an interesting improvement of the state-of-the-art: in fact, 
exact algorithms for computing centrality run in 0{n^) and, with some ingenious optimizations they 
can run in 0{nm) [TTllSj. Unfortunately, real-life social networks consist of up to millions nodes/edges 
[T5] . and, therefore these approaches may not scale well. By contrast, our algorithm works fairly well 
also on large real-life social networks even in presence of limited computing resources. 

• We provide results of the performed experimentation, showing that our approach is able to generate 
reproducible results even if it relies on random walks. Several experiments have been carried out 
in order to emphasize that the K-path edge centrality computation is feasible even on large social 
networks. Finally, the properties shown by this measure are discussed, in order to characterize each 
of the studied networks. 

The paper is organized as follows: in section [2] we provide some background information on the problems 
related to centrality measures. Section |3] presents the goal of this paper and our K-path edge centrality, 
including the fast algorithm for its computation. The experimental evaluation of performance of this strategy 
is discussed in section |4] and some possible applications of our approach are presented in section [sj Thus, 
the paper concludes in section |6] 

2. Background about Centrality Measures and Applications 

In this section we review the concept of centrality measure and illustrate some recent approaches to compute 
it. 

2.1. Centrality Measure in Social Networks 

One of the first (and the most popular) node centrality measures is the betweenness centrality [3]. It is 
defined as follows: 

Definition 1. (Betweenness centrality) Given a graph G — {V,E), the betweenness centrality for the node 
V G V is defined as 

where s and t are nodes in V , Ust is the number of shortest paths connecting s to t, and aat{v) is the number 
of shortest paths connecting s to t passing through the node v. 

If there is no path joining s and t we conventionally set '^"^^^'^ = 0. 

The concept of centrality has been defined also for the edges in a graph and, from a historical standpoint, 
the first approach to compute edge centrality has been proposed in 1971 by J.M. Anthonisse [HHl] and was 
implemented in the GRADAP software package. In this approach, edge centrality is interpreted as a "flow 
centrality" measure. To define it, let us consider a graph G = {V,E) and let s G V , t G V he a. fixed pair of 
nodes. Assume that a "unit of flow" is injected in the network by picking s as the source node and assume 
that this unit flows in G along the shortest paths. The rush index associated with the pair (s,t) and the 
edge e £ E is defined as 

<5st(e) = 
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being, as before, Gst the number of shortest paths connecting s to t, and CTst(e) the number of shortest paths 
connecting s to t passing through the edge e. As in the previous case, we conventionally set 5st{s) = if 
there is no path joining s and t. 

The rush index of an edge e ranges from (if e does not belong to any shortest path joining s and t) to 
1 (if e belongs to all the shortest paths joining s and t). Therefore, the higher 5st, the more relevant the 
contribution of e in the transfer of a unit of flow from s to t. The centrality of e can be defined by considering 
all the pairs (s,t) of nodes and by computing, for each pair, the rush index 5st(e); the centrality Cij^(e) of 
e is the sum of all these contributions 



More recently, in 2002 Girvan and Newman proposed in [5] a definition of edge betweenness centrality which 
strongly resembles that provided by Anthonisse. 

According to the notation introduced above, the edge betweenness centrality for the edge e G E is defined 
as 



(2) 



and it differs from that of Anthonisse because the source node s and the target node t must be different. 

Other, marginally different, definitions of betweenness centrality have been proposed by [20J , such as 
bounded-distance, distance-scaled, edge and group betweenness, and stress and load centrality. 

Although the appropriateness of the betweenness centrality in the representation of the "importance" of a 
node/edge inside the network is evident, its adoption is not always the unique solution to a given problem. 
For example, as already put into evidence by 6J, the first limit of the concept of betweenness centrality is 
related to the fact that infiuence or information does not propagate following only shortest paths. With 
regards to the influence propagation, it is also evident that the more distant two nodes are, the less they 
influence each other, as stated by [TT]. Additionally, in real applications (such as those described in section 



2.3) it is not usually required to calculate the exact ranking with respect to the betweenness centrality of 
each node/edge inside the network. In fact, it results more useful to identify the top arbitrary percentage of 
nodes/edges which are more relevant to the given speciflc problem (e.g., study of propagation of information, 
identification of key actors, etc.). 



2.2. Recent Approaches for Computing Betweenness Centrality 

As to date, several algorithms to compute the betweenness centrality (of nodes) in a graph have been 
presented. The most efficient has been proposed by |17| . which runs in 0{nm) for unweighted graphs, and 
in 0{nm + n^log n) for weighted graphs, containing n nodes and m edges. 

The computational complexity of these approaches makes them unfeasible for large network analysis. To this 
purpose, different approximate solutions have been proposed. Amongst others, [21] developed a random- 
ized algorithm (namely, "RA-Brandes" ) and, similarly by using adaptive techniques, [32] proposed another 
approximate version (called, "AS-Bader"). In [7], Newman devised a random- walk based algorithm to com- 
pute betweenness centrality which shares similarities to our approach, starting from the concept of message 
propagation along random paths. From the same concept, [^ proposed the K-path centrality measure (for 
nodes) and developed a 0(k^ ri^~^" log n) algorithm (namely, "RA-^path") to compute it. 
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2.S. Application of Centrality Measures in Social Network Analysis 

Applications of centrality information acquired from social networks have been investigated by |23| . The 
authors defined different methodologies to exploit discovered data, e.g., for marketing purposes, recommen- 
dation and trust analysis. 

Several marketing and commercial studies have been applied to online social networks (OSNs), in particular 
to discover efficient channels to distribute information [211 US] and to study the spread of influence [55] . 
Potentially, our study could provide useful information to all these applied research directions, identifying 
those interesting edges with high K-path edge centrality, which emphasizes their importance within the 
social network. Those nodes interconnected by high central edges are important because of the position 
they "topologically" occupy. Moreover, they could efficiently carry information to their neighborhood. 

3. Measuring Edge Centrality 

3.1. Design Goals 

Before to providing a formal description of our algorithm, we illustrate the main ideas behind it. We start 
from a real-life example and we use it to derive some "requirements" our algorithm should satisfy. 

Let us consider a network of devices. In this context, without loss of generality, we can assume that the 
simplest "piece" of information is a message. In addition, each device has an address book storing the devices 
with which it can exchange messages. A device can both receive and transmit messages to other devices 
appearing in its address book. 

The purpose of our algorithm is to rank links of the network on the basis of their aptitude of favoring the 
diffusion of information. In detail, the higher the rank of a link, the higher its ability of propagating a 
message. Henceforth, we refer to this problem as link ranking. 

The link ranking problem in our scenario can be viewed as the problem of computing edge centrality in 
social networks. We guess that some of the hypotheses/procedures adopted to compute edge centrality can 
be applied to solve the link ranking problem. We suggest to extend these techniques in a number of ways. 
In detail, we guess that the algorithm to compute the link ranking should satisfy the following requirements: 

Requirement 1 - Simulation of Message Propagation by using Random Walks. As shown in section [2| some 
authors assume that information fiows on a network along the shortest paths. Such an intuition is formally 
captured by Equation ([T]). However, as observed in [371 13 1 centrality measures based on shortest paths 
can provide some counterintuitive results. In detail, [271 [7] present some simple examples showing that the 
application of Equation ([T]) would lead to assign excessively low centrality scores to some nodes. 

To this purpose, some authors |27j provided a more refined definition of centrality relying on the concept 
of flow in a graph. To define this measure, assume that each edge in the network can carry one or more 
messages; we are interested in finding those edges capable of transferring the largest amount of messages 
between a source node s and a target node t. The centrality of a vertex v can be computed by considering 
all the pairs (s,i) of nodes and, for each pair, by computing the amount of flow passing through v. In the 
light of such a definition, in the computation of node centrality also non-shortest paths are considered. 

However, in [2, Newman shows that centrality measures based on the concept of fiow are not exempt from 
odd effects. To this purpose, the author suggests to consider a random walker which is not forced to move 
along the shortest paths of a network to compute the centrality of nodes. 

The Newman's strategy has been designed to compute node centrality, whereas our approach targets at 
computing edge centrality. Despite this difference, we believe that the idea of using random walks in place 
of shortest paths can be successful even when applied to the link ranking problem. 
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In our scenario, if a device wants to propagate a message, it is generally not aware of the whole network 
topology, and therefore it is not aware of the shortest paths to route the message. In fact, each device is 
only aware of the devices appearing in its address book. As a consequence, the device selects, according to 
its own criteria, one (or more) of its contacts and sends them the message in the hope that they will further 
continue the propagation. In order to simulate the message propagation, our first requirement is to exploit 
random walks. 

Requirement 2 - Dynamic Update of Ranking. Ideally, if we would simulate the propagation of multiple 
messages on our network of devices, it could happen that an edge is selected more frequently than others. 
Edges appearing more frequently than others show a better aptitude to spread messages and, therefore, their 
rank should be higher than others. As a consequence, our mechanism to rank edges should be dynamic: at 
the beginning, all the edges are equally likely to propagate a message and, therefore, they have the same 
rank. At each step of the simulation, if an edge is selected, it must be awarded by getting a "bonus score". 

Requirement 3 - Simple Paths. The procedure of simulating message propagation through random walks 
described above could imply that a message can pass through an edge more than once. In such a case, 
the rank of edges which are traversed multiple times would be disproportionately inflated whereas the rank 
of edges rarely (or never) visited could be underestimated. The global effect would be that the ranking 
produced by this approach would not be correct. As a consequence, another requirement is that the paths 
exploited by our algorithm must be simple. 

Requirement 4 - Bounded Length Paths. As shown in Tlj, the more distant two nodes are, the less they 
influence each other. The usage of paths of bounded length has been already explored to compute node 
centrality |2H1 HI]. A first relevant example is provided in [^Hl; in that paper the authors observe that 
methods to compute node centralities like those based on eigenvectors can lead to counterintuitive results. 
In fact, those methods take the whole network topology into account and, therefore, they compute the 
centrality of a node on a global scale. It may happen that a node could have a big impact on a small 
scale (think of a well-respected researcher working on a niche topic) but a limited visibility on a large scale. 
Therefore, the approach of suggested to compute node centralities in local networks and they considered 
ego networks. An ego network is defined as a network consisting of a single node (ego) together with the 
nodes it is connected to (the alters) and all the links among those alters. The diameter of an ego network is 
2 and, therefore, the computation of node centrality in a network requires to compute paths up to a length 
2. In [28] the authors extended these concepts by considering paths up to a length k. 

We agree with the observations above and figure that two nodes are considered to be distant if the shortest 
path connecting them is longer than k hops, being k the established threshold. Such a consideration depicts 
as effective paths only those paths whose length is up to k. We take this requirement and for our simulation 
procedure we considered paths of bounded length. 

In the next sections we shall discuss how our algorithm is able to incorporate the requirements illustrated 
above. 

3.2. ti-Path Centrality 

In this section we introduce the concepts of K-path node centrality and K-path edge centrality. 
The notion of K-path node centrality, introduced by |i9i, is defined as follows: 

Definition 2. (n-path node centrality) For each node v of a graph G = {V,E), the n-path node centrality 
C"'(w) of V is defined as the sum, over all possible source nodes s, of the frequency with which a message 
originated from s goes through v, assuming that the message traversals are only along random simple paths 
of at most K edges. 
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It can be formalized, for an arbitrary node v € as 



sev 




(3) 



where s are all the possible source nodes, cr^ (w) is the number of K-paths originating from s and passing 
through V and is the overall number of K-paths originating from s. 

Observe that Equation ([s]) resembles the definition of betweenness centrality provided in Equation ([T]). In 
fact, the structure of the two equations coincides if we replace the concept of shortest paths (adopted in the 
betweenness centrality) with the concept of K-paths which is the core of our definition of K-path centrality. 

The possibility of extending the concept of "centrality" from nodes to edges has been already exploited 
by Girvan and Newman [S]. In particular, they generalized the formulation of "betweenness centrality" 
(referred to nodes) , introducing the concept of "edge betweenness centrality" . 

Similarly, we extend Definition [2] in order to define an edge centrality index, baptized K-path edge centrality. 

Definition 3. (K-path edge centrality) For each edge e of a graph G = {V,E), the K-path edge centrality 
L'^{e) of e is defined as the sum, over all possible source nodes s, of the frequency with which a message 
originated from s traverses e, assuming that the message traversals are only along random simple paths of 
at most K edges. 

The K-path edge centrality is formalized, for an arbitrary edge e, as follows 



where s are all the possible source nodes, cr^ (s) the number of K-paths originating from s and traversing 
the edge e and, finally, is the number of K-paths originating from s. 

In practical cases, the application of Equation Q can not be feasible because it requires to count all the 
K-paths originating from all the source nodes s and such a number can be exponential in the number of 
nodes of G. To this purpose, we need to design some algorithms capable of efficiently approximating the 
value of K-path edge centrality. These algorithms will be introduced and discussed in the next subsections. 

3.3. The Algorithm for Computing the tv-Path Edge Centrality 

In this section we discuss an algorithm, called Edge Random Walk K-Path Centrality (or, shortly, ERW- 
Kpath), to efficiently compute edge centrality values. 

It consists of two main steps: (i) node and edge weights assignment and, (ii) simulation of message propa- 
gations through random simple paths. In the ERW-KPath algorithm, the probability of selecting a node or 
an edge are uniform; we provide also another version of the ERW-Kpath algorithm (called WERW-Kpath - 
Weighted Edge Random Walk n-Path Centrality) in which the node/edge probabilities are not uniform. 

We will show in the Appendix that the ERW-KPath and the WERW-Kpath algorithms return, as output, an 
approximate value of the edge centrality index as provided in Definition |3] and we will provide a quantitative 
assessment of such an approximation. 

In the following we shall discuss the ERW-KPath algorithm by illustrating each of the two steps composing it. 
After that, we will introduce the WERW-KPath algorithm as a generalization of the ERW-KPath algorithm. 




(4) 
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3.3.1. Step 1: node and edge weights assignment 



In the first stage of our algorithm, we assign a weight to both nodes and edges of the graph G = {V, E) 
representing our social network. Weights on nodes are used to select the source nodes from which each 
message propagation simulation starts. Weights on edges represent initial values of edge centrality and, to 
comply with Requirement 2, they will be updated during the execution of our algorithm. 

To compute weight on nodes, we introduce the normalized degree 5{vn) of a node w „ S V as follows: 

Definition 4. (Normalized degree) Given an undirected graph G — (V, E) and a node f„ G V , its normalized 
degree 5{vn) is 

SM = (5) 



11^1 



where I{vn) represents the set of edges incident on 



The normalized degree d{vn) correlates the degree of w„ and the number of total nodes on the network. 
Intuitively, it represents how much a node contributes to the overall connectivity of the graph. Its value 
belongs to the interval [0, 1] and the higher S{v„), the better u„ is connected in the graph. 

Regarding edge weights, we introduce the following definition: 

Definition 5. (Initial edge weight) Given an undirected graph G — (V, E) and an edge Cm G E, its initial 
edge weight ujQ{em) is 

Mem) = 1^ (6) 

Intuitively, the meaning of Equation ^ is as follows: we initially manage a "budget" consisting of \E\ 
points; these points are equally divided among all the possible edges; the amount of points received by an 
edge represents its initial rank. 

In Figure [l] we report an example of graph G along with the distribution of weights on nodes and edges. 







1/12 
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Figure 1: Example of assignment of normalized degrees and initial edge weights. 



3.3.2. Step 2: Simulation of message propagations through random simple K-paths 

In the second step we simulate multiple random walks on the graph G; this is consistent with Requirement 
1. 
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To this purpose, our algorithm iterates the following sub-steps a number of times equal to a value p, being p 
a fixed value. We will later provide a practical rule for tuning p. At each iteration, our algorithm performs 

the following operations: 

1. A node Vn &V is, selected according to one of the following two possible strategies: 

a. uniformly at random, with a probability 

n-n) = ^ (7) 

b. with a probability proportional to its normalized degree 6{vn), given by 

2. All the edges in G are marked as not traversed. 

3. The procedure MessagePropagation is invoked. It generates a simple random walk whose length is not 
greater than k, satisfying Requirement 3. 

Let us describe the procedure MessagePropagation. This procedure carries out a loop as long as both the 
following conditions hold true: 

• The length of the path currently generated is no greater than k. This is managed through a length 
counter N. 

• Assuming that the walk has reached the node i;„, there must exist at least an incident edge on w„ 
which has not been already traversed. To do so, we attach a flag T{em) to each edge S E, such 
that 

1 if em has already been traversed 
otherwise 



Tiem) 

We observe that the following condition must be true 



|/K)|> J2 ^(e'^) (9) 

being the set of edges incident onto Vn- 

The former condition complies with Requirement 4 (i.e., it allows us to consider only paths up to length 
k). The latter condition, instead, avoids that the message passes more than once through an edge, thus 
satisfying Requirement 3. 

If the conditions above are satisfied, the MessagePropagation procedure selects an edge by applying two 
strategies: 

a. uniformly at random, with a probability 

Pie^) = |w ^| (10) 

among all the edges G {H'^n) \ T{em) = 0} incident on w„ (i.e., excluding already traversed edges); 

b. with a probability proportional to the edge weight uJi{em), given by 

P{em) = ^ "'^^"^ . . (11) 
being i{vn) = {e/c G /(u„) | T(eft) = 0} and uii{em) = Wi-i(em) + (3 ■ T{em) ii I < I < up. 
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Let Cm be the selected edge and let be the node reached from u„ by means of Cm- The MessagePropa- 
gation procedure awards a bonus /3 to Cm, sets T{e,n) = 1 and increases the counter N by 1. The message 
propagation activity continues from 

At the end, each edge e G -B is assigned a centrality index L'^{e) equal to its final weight LLi^p(e). 

The values of /3 and p, in principle, can be fixed in an arbitrary fashion but we provide a simple practical 
rule to tune them. Due to Theorem 6.2 reported in the Appendix, it emerges that in ERW-KPath it is 
convenient to set p ~ \E\. In particular, if we set p — \E\ — 1 and /3 = we get a nice result: the edge 
centrality indexes always range in [j-^, 1] and, ideally, the centrality index of a given edge will be equal to 
1 if (and only if) it is always selected in any message propagation simulation. In fact, each edge initially 
receives a default score equal to and if that edge is selected in a subsequent trial, it will increase its 

score by a factor P — j^- Intuitively, if an edge is selected in all the trials, its final score will be equal to 

\E\ ^ P \E\- \E\ + IBI - ^■ 

The time complexity of this algorithm is 0{kp). If we fix p = \E\ — 1, we achieve a good trade-off between 
accuracy and computational costs. In fact, in such a case, the worst case time complexity of the ERW- 
KPath algorithm is 0{k\E\) and, since in real social networks \E\ is of the same order of magnitude of \V\, 
the time complexity of our approach is near linear against the number of nodes. This makes our approach 
computationally feasible also for large real-life social networks. 

The version of the algorithm shown in Algorithms [T] and [2] adopts uniform probability distribution functions 
in order to choose nodes and edges purely at random and, as said before, it is called ERW-KPath. 

A weighted version of the same algorithm, called WERW-KPath, would differ only in line 5 (Algorithm 
[T]) and 2 (Algorithm [2]), adopting weighted functions specified in Equations ([8| and (111. During our 
experimentation we always adopted the WERW-Kpath algorithm, for the motivations explained in section 



Algorithm 1 ERW-Kpath( Graph G = {V,E), int k, int p, float /?) 
1: Assign each node Vn E V its normalized degree 

2: Assign each edge Cm ^ E the uniform probability function as weight 
3: for i = 1 to p do 

4: iV <~ a counter to check the length of the K-path 
5: a node chosen uniformly at random in V 

6: MessagePropagation(u„, N , k, (3) 
7: end for 



Algorithm 2 MessagePropagation(Node ti„, int iV, int k, float /? 



while iV < K and |/(i;)| > Y.eei(v) T{e) do 

em Cm S I T{em) — 0}, chosen uniformly at random 

Let Vn+i be the node reached by w„ through e„i 
w(e,„) ^ uj{em) + P 
Tiem) ^ 1 



N ^ N i 
end while 
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3.4- Novelties introduced by our approach 

In this section we discuss the main novelties introduced by our ERW-Kpath and WERW-Kpath algorithms. 
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First of all, we observe that our approach is flexible in the sense that it can be easily modified to incorporate 
new models capable of describing the spread of a message in a network. For instance, we can define multiple 
strategies to select the source node from which each message propagation simulation starts. In particular, 
in this paper we considered two chances, namely: (i) the probability of selecting a node s as the source is 
uniform across all the nodes in the network (and this is at the basis of the ERW-Kpath algorithm) or (ii) 
the probability of selecting a node s as the source is proportional to the degree of s (and this is at the basis 
of the WERW-Kpath). It would be easy to select a different probability distribution, if necessary. In an 
analogous fashion, in the ERW-Kpath and WERW-Kpath algorithms we defined two strategies to select the 
node receiving a message; of course, other, and more complex, strategies could be implemented in order to 
replace those described in this paper. 

In addition, observe that the ERW-Kpath and WERW-Kpath algorithms provide a unicast propagation 
model in which any sender node is in charge of selecting exactly one receiving node. We could easily modify 
our algorithms in such a way as to support a multicast propagation model in which a node could issue a 
message to multiple receivers. 

A further novelty is that we use multiple random walks to simulate the propagation of messages and assume 
that the frequency of selecting an edge e in these walks is a measure of its centrality. An approach similar 
to our was presented in ^30] but it assumes that messages propagate along shortest paths. In detail, given 
a pair of nodes i and j, the approach of [SOj introduces a parameter, called network efficiency Sij as the 
inverse of the length of the shortest path(s) connecting i and j. After that, it provides a new parameter, 
called information centrality] the information centrality ICe of an edge e is defined as the relative drop in the 
network efficiency generated by the removal of e from the network. Our approach provides some novelties in 
comparison with that of 30]: in fact, in our approach a network is viewed as a decentralized system in which 
there is no user having a complete knowledge of the network topology. Due to this incomplete knowledge, 
users are not able to identify shortest path and, therefore, they use a probabilistic model to spread messages. 
This yields also relevant computational consequences: the identification of all the pairs of shortest paths in 
a network is computationally expensive and it could be unfeasible on networks containing millions of nodes. 
By contrast, our approach scales almost linearly with the number of edges and, therefore, it can easily run 
also over large networks. 

Finally, despite our approach relies on the concept of message propagation which requires an orientation on 
edges, it can work also on undirected networks. In fact, the ERW-Kpath (resp., WERW-Kpath) algorithm 
selects at the beginning a source node s that decides the node v to which a message has to be forwarded. 
Therefore, at run-time, the ERW-Kpath (resp., WERW-Kpath) algorithm induces an orientation on the 
edge linking s and v which coincides with the direction of the message sent by s; such a process does not 
require to operate on directed networks, even if it could intrinsically work well with such a type of networks. 



3.5. Comparison of the ERW-Kpath and WERW-Kpath algorithms 

In this section we provide a comparison between ERW-Kpath and WERW-Kpath. First of all, we would like 



to observe that, according to Theorem 6.2 both the two algorithms are capable of correctly approximating 



the K-path centrality values provided in Definition |3] 

Despite the two algorithms are formally correct, however, we observe that the WERW-Kpath algorithm 
should be preferred to ERW-Kpath. In fact, in the ERW-Kpath algorithm, we assume that each node can 
select, at random, any edge (among those that have not yet been selected) to propagate a message. Such an 
assumption could be, however, too strong in real-life social networks. To better clarify this concept, consider 
online social networks like Facebook or Twitter. In both of these networks a single user may have a large 
number of contacts with whom she/he can exchange information (e.g., a wall post on Facebook or a tweet 
on Twitter). However, sociological studies reveal that there is an upper limit to the number of people with 
whom a user could maintain stable social relationships and this number is known as Dunbar number |31j . 
For instance, in Facebook, the average number of friends of a user is 120. On the other hand, it has been 
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reported that male users actively communicate with only 10 of them, whereas female users with Iq^ This 
implies that there are preferential edges along which information flows in social networks. 

The ERW-Kpath algorithm is simple and easy to implement but it could fail to identify preferential edges 
along which messages propagate. By contrast, in the WERW-Kpath algorithm, the probability of selecting 
an edge is proportional to the weight already acquired by that edge. This weight, therefore, has to be 
intended as the frequency with which two nodes exchanged messages in the past. 

Such a property has also a relevant implication and makes feasible some applications which could not be 
implemented by the ERW-Kpath algorithm. In fact, our approach, to some extent can be exploited to 
recommend/predict links in a social network. The problem of recommending/predicting links plays a key 
role in Computer Science and Sociology and it is often known in the literature as the link prediction problem 
[32) . In the link prediction problem, the network topology is analyzed to find pairs of non-connected nodes 
which could get a profit by creating a social link. Various measures can be exploited to assess whether a link 
should be recommended between a pair of nodes u and v; for instance, the simplest measure is to compute 
the Jaccard coefficient J(u, v) on the neighbors of u and v. The larger the number of neighboring nodes 
shared by u and v, the larger J(u, v); in such a case it is convenient to add an edge in the network linking u 
and V. Further (and more complex measures) take the whole network topology into account to recommend 
links. For instance, the Katz coefficient ^32) considers the whole ensemble of paths running between u and 
V to decide whether a link between them should be recommended. 

The WERW-Kpath algorithm can be exploited to address the link prediction problem. In detail, by means of 
WERW-Kpath, we can handle not only topological information but we can also quantify the strength of the 
relationship joining two nodes. So, we know that two nodes u and v are connected and, in addition, we know 
also how frequently they exchange information. This allows us to extend the measure introduced above: 
for instance, if we would like to use the Jaccard coefficient, we can consider only those edges (called strong 
edges) coming out from u (resp., v) such that the weight of these edge is greater than a given threshold. This 
is equivalent to filter out all the edges which are rarely employed to spread information. As a consequence, 
the Jaccard coefficient could be computed only on strong edges. 

Due to these reasons, in the following experiments we focused only on the WERW-Kpath algorithm. 
4. Experimentation 

Our experimentation has been conducted on different online social networks whose datasets are available. 
Adopted datasets have been summarized in Table [l] 

Dataset 1 depicts the voting system of Wikipedia for the elections of January 2008. Datasets 2 and 3 
represent the Arxi\j^ archives of papers in the field of, respectively, High Energy Physics (Phenomenology) 
and Condensed Matter Physics, as of April 2003. Dataset 4 represents a network of scientific citations among 
papers belonging to the Arxiv High Energy Physics (Theory) field. Dataset 5 describes a small sample of the 
Facebook network, representing its friendship graph. Finally, Dataset 6 depicts a fragment of the YouTube 
social graph as of 2007. 

4.I. Robustness 

A quality required for a good random-walk based algorithm is the robustness of results. In fact, it is important 
that obtained results are consistent among different iterations of the algorithm, if initial conditions are the 



^http : //www . economist . com/node/13176775?story_id=13176775 

^ Arxiv (http://arxiv.org/) is an online archive tor scientilic preprints in the fields of Mathematics, Physics and Computer 
Science, amongst others. 
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# 


Network 


No. Nodes 


No. Edges 


Directed 


Type 


Ref 


1 


Wiki-Vote 


7,115 


103,689 


Yes 


Elections 


l33j 


2 


CA-HepPh 


12,008 


237,010 


No 


Co-authors 


[33] 


3 


CA-CondMat 


23,133 


186,932 


No 


Co-authors 


[33] 


4 


Cit-HepTh 


27,770 


352,807 


Yes 


Citations 


M 


5 


Faccbook 


63,731 


1,545,684 


Yes 


Online SN 




6 


Youtubc 


1,138,499 


4,945,382 


No 


Online SN 


m 



Table 1: Datasets adopted in our experimentation. 



same. In order to verify that our WERW-Kpath produces reliable results, we performed a quantitative and 
a qualitative analysis as follows. 

In the quantitative analysis we are interested in checking whether the algorithm produces the same results 
in different runs. In the qualitative analysis, instead, we studied whether different values of k deeply impact 
on the ranking of edges. 



4.I.I. Quantitative analysis of results 

Our first experimentation is in order to verify that, over different iterations with the same configuration, 
results are consistent. It is possible to highlight this aspect, running several times the WERW-Kpath 
algorithm on the same dataset, with the same configuration. 



Regarding p, in the experimentation we adopt p = \E\ — 1, which is consistent with Theorem 6.2 According 



to the previous choice, the bonus awarded is fixed to (3 = jj^. As for the maximum length of the K-paths, 
we chose a value of k = 20. 

Our quantitative analysis highlights that the distributions of values are almost completely overlapping, over 
different runs on each dataset among those considered in Table [T] 

In Figure [2] we graphically report the distribution of edge centrality values for the "Wiki-Vote" dataset. 
Results are from four different runs of the algorithm on the same dataset with the same configuration. Data 
are plotted using a semi-logarithmic scale in order to highlight the "high" part of the distribution, where 
edges with high K-path edge centrality lie. 

Similar results are confirmed performing the same test over each considered dataset but they are not reported 
due to space limitations. The robustness property is necessary but not sufficient to ensure the correctness 
of our algorithm. 

In fact, the quantitative evaluation we performed ensures that centrality values produced by WERW-Kpath 
are consistent over different runs of the algorithm, but does not ensure that, for example, a same edge e & E 
after the Run 1 has a centrality value which is the same (or, at least, very similar) that after Run 2. In 
other words, those values of centrality that overlap in different distributions may be not referred to the same 
edges. 

To the purpose of investigating this aspect we analyze results from a qualitative perspective, as follows. 



4-1-2. Qualitative analysis of results 

Our random-walk-based approach ensures minimum fluctuations of centrality values assigned to each edge 
along different runs, if the configuration of each run is the same. 

To verify this aspect, we calculate the similarity of the distributions obtained by running WERW-Kpath four 
times on each dataset, using the same configuration, comparing results by adopting different measures. For 
this experiment, we considered different settings for the length of the exploited K-paths, i.e., k ~ 5, 10,20, 
in order to investigate also its impact. 
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Figure 2: Robustness test on "Wiki-Vote". 



The first measure considered is a variant of tlic Jaccard coefficient, classically defined as 

where X and Y represent, in our case, a pair of compared distributions of K-path edge centrality values. 

In order to define the Jaccard coefficient in our context we need to take into account the following con- 
siderations. Let us consider two runs of our algorithms, say X and Y and let us first consider an edge e; 
let us denote with ujx{e) (resp., LOyie)) the centrality index of e in the run X (resp., Y); intuitively, the 
performance of our algorithm is "good" if ujx{s) is close to a;y(e); however, a direct comparison of the two 
values could make no sense because, for instance, the edge e could have the highest weight in both the two 
runs but uix{e-) may significantly differ from a;y(e). Therefore, we need to consider the normalized values 
"^^'^^ — and max'^^'^Lfel assumc that the algorithm yields good results if these values are "close" . 



To make this definition more rigorous we can define A(e) 



t»Jx(e) (e) 



and we say that the 



algorithm produces good results if A(e) is smaller than a threshold e. 

Now, in order to fix the value of e, let us consider the values achieved by A(e) for each e e -E. We can 
provide an upper bound A on A(e) by considering two extremal cases: (i) wx(e) = maxegxw(e) and 
wy(e) = minegya;(e) or, vice versa, (ii) = minegxw(e) and a;y(e) — maxegya;(e). For the sake of 

simplicity, assume that case (i) occurs; of course, the following considerations hold true also in case (ii). 

As discussed in the following (see Figures |4|and[5|, edge 



In such a case we obtain A — _l 

centralities are distributed according to a power law and, therefore, the value of miuggy w(e) is some orders 
of magnitude smaller than maxegy w(e). Therefore, the ratio of miuegy w(e) to maxggy w(e) tends to and 
A tends 1. 

According to these considerations, we computed how many times the following condition holds true A(e) < 
tA, being < r < 1 a tolerance threshold. Since A ~ 1, this amounts to counting how many times A(e) < t. 
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Therefore, we can define the modified Jaccard coefficient as follows 



lie ■ I "^t'^) '^y(g) I < t\\ 

J {X, Y) - ^^^^ (13) 

In our tests we considered the following values of tolerance r = 0.01,0.05,0.10 to identify 1%, 5% and 10% 
of maximum accepted variation of the edge centrality value assigned to a given edge along different runs 
with same configurations. 



A mean degree of similarity avg 



is taken to average the (2) = 6 possible combinations of pairs of 



distributions obtained by analyzing the four runs over the datasets discussed above. 

The second measure we consider is the Pearson correlation. It is adopted to evaluate the correlation of the 
two obtained distributions. It is defined as 

cov{X,Y) 

PX.Y = , (14) 
\J var{X) ■ var{Y) 

whose results are normalized in the interval [— 1,+1], with the following interpretations: 

• Px,Y > 0: distributions are directly correlated, in particular: 

^ Px,Y > 0.7: strongly correlated; 

— 0.3 < px,Y < 0.7: moderately correlated; 

— < px.Y < 0.3: weakly correlated; 

• Px.Y = 0: not correlated; 

• Px,Y < 0: inversely correlated. 

Clearly, the higher px.Y, the better the WERW-KPath algorithm works. Observe that the px,Y coefficient 
tells us whether the two distributions X and Y are deterministically related or not. Therefore, it could 
happen that the WERW-KPath algorithm, in two different runs generates two edge centrality distributions 
X and Y such that Y = aX, being a a real coefficient. In such a case, the px,Y coefficient would be 1 but 
we could not conclude that the algorithm works properly. In fact, the coefficient a could be very low (or 
in the opposite case very large) and, therefore, the two distributions would significantly differ even if they 
would preserve the same edge rankings. 

To this purpose, we consider a third measure in order to compute the distance between the two distributions 
X and Y. To do so, we adopt the Euclidean distance L2{X, Y) defined as 



L2{X,Y) 



\ 



J2iX,-Y,f (15) 



As it emerges from the distributions shown in Figure [2j almost all the terms in Equation (15 1 annul each 
other, and therefore, the final value of -L2(-'^, Y) is dominated by the difference of the K-path centrality values 
associated with the few top-ranked edges. To obtain the average distance between two points in distribution 
X and y in a given dataset, we should simply divide L2{X, Y) by the number of edges in that dataset. 

Intrinsic characteristics of analyzed datasets do not infiuence the robustness of results. In fact, even if 
considering datasets representing different social networks (e.g., collaboration networks, citation networks 
and online communities), WERW-Kpath produces highly overlapping results over different runs. 
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Dataset 




K 




T = 0.01 


T = 0.05 


r = 0.10 


Px,Y 


L2{X, Y) 


avg(L2(X,y)) 




K 




5 


43.52% 


98.49% 


99.91% 


0.67 


1.61-10"^ 


1.55-10-'' 


Wiki-Vote 


K 


= 


10 


61.13% 


98.86% 


99.98% 


0.69 


2.37-10-2 


2.28-10-^ 




K 


= 


20 


70.68% 


99.96% 


99.98% 


0.70 


3.48-10-2 


3.35-10-^ 




K 




5 


52.63% 


96.11% 


99.53% 


0.92 


1.18-10-2 


4.97-10-** 


CA-HepPh 


K 


= 


10 


70.45% 


99.02% 


99.88% 


0.95 


1.23-10-2 


5. 18-10-* 




K 




20 


75.65% 


99.51% 


99.87% 


0.96 


2.90-10^2 


1.22-10-^ 




K 




5 


22.23% 


80.51% 


96.98% 


0.73 


1.39-10-2 


7.43-10-** 


CA-CondMat 


K 




10 


35.16% 


93.72% 


99.40% 


0.79 


2.18-10-2 


1.16-10-'' 




K 




20 


35.63% 


95.80% 


99.44% 


0.83 


3.40-10-2 


1.81-10-'' 




K 




5 


47.62% 


97.76% 


99.78% 


0.78 


0.92-10-2 


2.60-10-** 


Cit-HepTh 


K 




10 


60.61% 


99.45% 


99.93% 


0.83 


1.36-10-2 


3.85-10-** 




K 




20 


63.68% 


99.62% 


99.93% 


0.85 


2.04-10-2 


5.78-10-** 




K 




5 


56.98% 


97.34% 


99.36% 


0.79 


1.01-10^2 


5.11-10-'-* 


Facebook 


K 




10 


56.85% 


98.49% 


99.76% 


0.84 


1.87-10-2 


1.20-10-** 




K 




20 


68.58% 


99.39% 


99.90% 


0.84 


2.67-10-2 


1. 72-10-* 




K 




5 


11.74% 


44.28% 


72.41% 


0.49 


1.31-10-^ 


2.64-10-^" 


Youtube 


K 




10 


13.18% 


59.40% 


84.91% 


0.75 


1.87-10-=* 


3.78-10-1" 




K 




20 


27.92% 


82.29% 


96.17% 


0.89 


2.83-10-=* 


5.72-10-1" 



Table 2; Analysis by using similarity coefficient J/„\ , correlation px,Y and Euclidean distance L2{X, Y). 



Already adopting a low tolerance, such as r = 0.01 or r = 0.05, values of K-path edge centrality are highly 
overlapping. Results improve according to the length of the K-path adopted. By increasing tolerance and/or 
length of K-paths, the full overlap became obvious. The same considerations hold true with respect to the 
Pearson correlation coefficient which identifies strong correlations among all the different distributions. 

Finally, as for the Euclidean distance, we observe that returned values are always small and, in every case 
the distance is no larger than [10"^, lO""^] and the average distance is around [10-^, 10-^°]. 

4-2. Performance 

All the experiments have been carried out by using a standard Personal Computer equipped with a Intel i5 
Processor with 4 GB of RAM. The implementation of the WERW-Kpath algorithm adopted in the following 
experiments, developed by using Java 1.6, has been releasecQand its adoption is strongly encouraged. 

As shown in Figure [sj the execution of WERW-Kpath scales very well (i.e., almost linearly) according with 
the setup of the length of the K-paths and with respect to the number of edges in the given network. 

This means that this approach is feasible also for the analysis of large networks, making it possible to 
compute an efficient centrality measure for edges in all those cases in which it would be very difficult or even 
unfeasible, for the computational cost, to calculate the exact edge-betweenness [5]. 

The importance of this aspect is evident if we consider that there exist several Social Network Analysis tools, 
that implement different algorithms to compute centrality indices on network nodes/edges. Our measure 
could be integrated in such tools (e.g., NodeXlj^ Pajelj^ NWI^ and so on), in order to allow social network 
analysts, to manage (possibly, even larger) social networks in order to study the centrality of edges. 



^http: / /www.emilio.ferrara.name/werw-kpath/ 
*http:/ /nodexl. codeplex.com/ 
""http: / /pajek.imfm.si/doku.php?id=pajek 
^http:/ /nwb. ens. iu.edu/ 
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Execution Time 

■ Time [sj 
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Figure 3: Execution time with respect to network size. 



4-3. Analysis of Edge Centrality Distributions 

In this section we study the distribution of edge centrahty values computed by the WERW-Kpath algorithm. 
In detail, we present the results of two experiments. 

In the first experiment we ran our algorithm four times. In addition, we varied the value of k = 5, 10, 20. We 
averaged the K-path centrality values at each iteration and we plotted the edge centrality distribution; on 
the horizontal axis we reported the identifier of each edge. The results are reported in Figure[4]by exploiting 
a logarithmic scale. The figure has the following interpretation: on the x-axis it represents each edge of the 
given network, on the y-axis its corresponding value of K-path edge centrality. 

The usage of a logarithmic scale highlights a power law distribution for the centrality values. In fact, when 
the behavior in a log-log scale resembles a straight line, the distribution could be well approximated by using 
a power law function f{x) ex. x~°'. As a result, for the all considered datasets, there are few edges with high 
centrality values whereas a large fraction of edges presents low (or very low) centrality values. Such a result 
can be explained by recalling that, at the beginning, our algorithm considers all the edges on an equal foot 
and provides them with an initial score which is the same for all the edges. However, during the algorithm 
execution, it happens that few edges (which are actually the most central edges in a social network) are 
frequently selected and, therefore, their centrality index is frequently updated. By contrast, many edges are 
seldom selected and, therefore, their centrality index is rarely increased. This process yields a power law 
distribution in edge centrality values. 

In the second experiment, we studied how the value of k impacted on edge centrality. In detail, we con- 
sidered the datasets separately and repeated the experiments described above. Also for this experiment 
we considered three different values for k, namely k — 5, 10, 20. The corresponding results are plotted in 
Figure [5] where the probability P of finding an edge in the network which has the given value of centrality 
is plotted as a function of the K-path centrality. Each plot adopts a log-log scale. 

The analysis of this figure highlights three relevant facts: 

• The probability of finding edges in the network with the lowest K-path edge centrality values is smaller 
than finding edges with relatively higher centrality values. This means that the most of the edges are 
exploited for the message propagation by the random walks a number of times greater than zero. 

• The power law distribution in edge centrality emerges even more for different values of k and in presence 
of different datasets. In other words, if we use different values of k the centrality indexes may change 
(see below); however, as emerges from Figure |4j for each considered dataset, the curves representing k 
path centrality values are straight and parallel lines with the exception of the latest part. This implies 
that, for a fixed value of n, say k = 5, an edge e will have a particular centrality score. If k passes 
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from 5 to 10 and, then, from 10 to 20, the centrahty of e will be increased by a constant factor. This 
implies that the ordering of the edges remains unchanged and, therefore, the edge having the highest 
centrality at k = 5 will continue to be the most central edges also when k = 10 and k = 20. This 
highlights a nice feature of WERW-Kpath: potential uncertainties on the tuning of the parameter k 
do not have a devastating impact on the process of identifying the highest ranked edges. 

• The higher k, the higher the value of centrality indexes. This has an intuitive explanation. If k 
increases, our algorithm manages longer paths to compute centrality values. Therefore, the chance 
that an edge is selected multiple times increases too. Each time an edge is selected, our algorithm 
awards it by a bonus score (equal to As a consequence, the larger k, the higher the number of 
times an edge with high centrality will be selected, and ultimately, the higher its final centrality index. 

Such a consideration provides a practical criterion for tuning k. In fact, if we select high values of 
K, we are able to better discriminate edges with high centrality from edges with low centrality. By 
contrast, in presence of low values of k, edge centrality indexes tend to edge flatten in a small interval 
and it is harder to distinguish high centrality edges from low centrality ones. 

On the one hand, therefore, it would be fine to fix k as high as possible. On the other, since the 
complexity of our algorithm is 0{Km), large values of k negatively impact on the performance of our 
algorithm. A good trade-off (explained by the experiments showed in this section) is to fix k = 20. 

5. Applications of our approach in Knowledge-Based systems 

In this section we detail some possible applications of our approach to rank edges in social networks in the 
area of Knowledge-Based systems (hereafter, KBS). 

In detail, we shall focus on three possible applications. The first is data clustering and we will show how our 
approach can be employed in conjunction with a clustering algorithm with the aim of better organizing data 
available in a KBS. The second is related to the Semantic Web and we will show how our approach can be 
used to assess the strength of the semantic association between two objects and how this feature is useful to 
improve the task of discovering new knowledge in a KBS. The third, finally, is related to better understand 
the relationship and the roles of user in virtual communities; in this case we show that our approach is useful 
to elucidate relationships like trust ones. 

5.1. Data Clustering 

A central theme in KBS-related research is the design and implementation of effective data clustering 
algorithms [T^]. In fact, if a KBS has to manage massive datasets (potentially split across multiple data 
sources), clustering algorithms can be used to organize available data at different levels of abstraction. The 
end user (both a human user or a software program) can focus only on the portion of data which are the 
most relevant to her/him rather than exploring the whole data space managed by a KBS [12l ESj [36] . If we 
ideally assume that any data managed by a KBS is mapped onto a point of a multidimensional space, the 
task of clustering available data requires to compute the mutual distance existing between any pair of data 
points. 

Such a task, however, is in many cases unfeasible. In fact, the computation of the distance can be pro- 
hibitively time-consuming if the number of data points is very large. In addition, KBS often manage data 
which are related each other but, for these kind of data, the computation of a distance could make no-sense: 
think, for instance, of data on health status of a person and her/his demographic data like age or gender. 

Therefore, many authors suggest to represent data as graphs such that each node represents a data point 
and each edge specifies the type of relationships binding two nodes. The problem of clustering graphs has 
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been extensively studied in the past and several algorithms have been proposed. In particular, the graph 
clustering problem in the social network literature is also known as community detection problem |37| . 

One of the early algorithms to find communities in graphs/networks was proposed by Girvan and Newman 
in 2002 [5 . Unfortunately, due to its high computational complexity, the Girvan-Newman algorithm can 
not be applied on very large and complex data repositories consisting of million of information objects. 

Our algorithm, instead, can be employed to rank edges in networks and to find communities. This is an 
ongoing research effort and the first results are quite encouraging ^38j . 

Once a community finding algorithm is available we can design complex applications to effectively manage 
data in a KBS. For instance, in [T3] the authors focused on online social networks like Internet newsgroups 
and chat rooms. They analyzed through semantic tools the text comments posted by users and this allowed 
large online social networks to be mapped onto weighted graphs. The authors showed that the discovery of 
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Figure 5: Effect of different k = 5, 10, 20 on networks described in Table [T] 



the latent communities is a useful way to better understand patterns of interactions among users and how 
opinions spread in the network. 

We then describe two use cases possibly benefiting from community detection algorithms. In the first case, 
consider a social network in which users fill a profile specifying their interests. A graph can be constructed 
which records users (mapped onto nodes) and relationship among them (e.g., an edge between two nodes 
may indicate that two users share at least one interest). Our algorithm, therefore, could identify group of 
users showing the same interests. 

Therefore, given an arbitrary message (for instance a commercial advertisement) we could identify groups 
of users interested to it and we could selectively send the message only to interested groups. 

As an opposite application, we can consider the objects generated within a social media platform. These 
objects could be for instance photos in a platform like Flickr or musical tracks in a platform like Last.fm. 
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We can map the space of user generated contents onto a graph and apply on it our community detection 
algorithm. In this way we could design advanced query tools: in fact, once a user issues a query, a KBS may 
retrieve not only the objects exactly labeled by the keywords composing user queries but also objects falling 
in the same community of the retrieved objects. In this way, users could retrieve objects of their interest 
even if they are not aware about their existence. 

5.2. Semantic Web 

A further research scenario that can take advantage from our research work is represented by the Semantic 
Web. In detail. Semantic Web tools like RDF allow complex and real-life scenarios to be modeled by means 
of networks. In many cases these networks are called multi-relational networks (or semantic networks) 
because they consist of heterogeneous objects and many type of relationships can exist among them |39) . 

For instance, an RDF knowledge base in the e-learning domain [1^ could consist of students, instructors and 
learning materials in a University. In this case, the RDF knowledge base could be converted to a semantic 
network in which nodes are the players described above. Of course, an edge may link two students (for 
instance, if they are friends or if they are enrolled in the same BsC programme), a student and a learning 
object (if a student is interested in that learning object), an instructor and a learning material (if the 
instructor authored that learning material) and so on [41] . 

A relevant theme in Semantic Web is to assess the weight of the relationships binding two objects because 
this is beneficial to discover new knowledge. For instance, in the case of the e-learning example described 
above, if a student has downloaded multiple learning objects on the same topic, the weight of an edge 
linking the student and a learning material would reflect the relevance of that learning material to the 
student. Therefore, learning materials can be ranked on the basis of their relevance to the user and only the 
most relevant learning materials can be suggested to the user. 

An approach like ours, therefore, could have a relevant impact in this application scenario because we could 
find interesting associations among items by automatically computing the weight of the ties connecting them. 
To the best of our knowledge there are few works on the computation of node centrality in semantic networks 
[55] but, recently some authors suggest to extend parameters introduced in Social Network Analysis like the 
concept of shortest path to multi- relational networks |14) . 

Therefore, we plan to extend our approach to the context of semantic networks. Our aim is to use simple 
random walks in place of shortest paths to efficiently discover relevant associations between nodes in a 
semantic network and to experimentally compare the quality of the results produced by our approach 
against that achieved by approaches relying on shortest paths. 

5.3. Understanding user relationships in virtual communities 

A central theme in KBS research is represented by the extraction of patterns of interactions among humans 
in a virtual community and their analysis with the goal of understanding how humans influence each other. 

A relevant problem is represented by the classiflcation of the relationship of humans on the basis of their 
intensity. For instance, in |15j the authors focus on the criminal justice domain and, in particular, on 
the identiflcation of social ties playing a crucial role in the transmission of sensitive information. In |42j . 
the author provides a belief propagation algorithm which exploits social ties among members of a criminal 
social network to identify criminals. Our approach resembles that of 15j because both of them are able too 
associate each edge in a network with a score indicating the strength of the association between the nodes 
linked by that edge. 

A special case occurs when we assume that the edge connecting two nodes specifles a trust relationship 
[43l |44] . In [43] , the authors suggest to propagate trust values along paths in the social network graph. In 
an analogous fashion, the approach of 44; uses path in the social network graph to propagate trust values 
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and infer trust relationships between pairs of unknown users. Finally, Reinforcement Learning techniques 
are applied to estimate to what extent an inferred trust relationship has to be considered as credible. Our 
approach is similar to those presented above because both of them rely on a diffusion model. In j43l I44j , the 
main assumption is that trust reflects the transitive property, i.e., if a user x trusts a user y who, in her/his 
turn, trusts a user z, then we can assume that x trusts z too. In our approach, we exploit connections among 
nodes to propagate messages by using simple random walks of bounded length. There are, however, some 
relevant differences: in the approaches devoted to compute trust all the paths of any arbitrary length are, 
in principle, useful to compute trust values even if the contribution brought in by long paths is considered 
less relevant than that of short paths. Vice versa, in our approach, the length of a path is bounded by a 
fixed constant k. 



6. Conclusions 

In this paper we introduced an edge centrality measure in social networks called K-path edge centrality 
index. Its computation is computationally feasible even on large scale networks by using the algorithm we 
provided. It performs multiple random walks on the social network graph, which are simple and their length 
is bounded by a factor k. We showed that the worst-case time complexity of our algorithm is 0(Km), being 
m the number of edges in the social network graph. Finally, we discussed experimental results obtained by 
applying our method to different online social network datasets. 

We plan to extend our work in several directions. First of all, our centrality measure can be used to detect 
communities in large social networks. Such a task is currently unfeasible if we use classic measures like edge 
betweenness centrality. In fact, to the best of our knowledge, efficient algorithms do not currently exist 
that estimate the community structure of a large network based on global topological information and our 
strategy could fit well to this purpose. We believe that our approach could be beneficial in the field of visu- 
alization of large social networks as well. In fact, recently it has been advanced the possibility of exploiting 
efficient network clustering techniques based on edge bundling to improve the graphical representation of 
the hierarchical structure of social networks |3Sj. 

In addition, we plan to design an algorithm to estimate the strength of ties between two social network 
actors: for instance, in social networks like Facebook this is equivalent to estimate the friendship degree 
between a pair of users. 

Finally, we point out that some researchers studied how to design parallel algorithms to compute centrality 
measures; for instance, [46] proposed a fast and parallel algorithm to compute betweenness centrality. We 
guess that a new, interesting, research opportunity is to design parallel algorithms to compute the K-path 
edge centrality. 
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Appendix 

In this section we shall analyze the correctness of our ERW-KPath and WERW-KPath algorithms. In details, 
we will study how the centrality indexes returned by these algorithms are related to the actual centrality 
values provided in Definition [3] 
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To explain our results it is convenient to re-write Equation Q in a more manageable fashion. First of all 
let us consider an undirected graph G = {V, E) and denote as 4)si an arbitrary simple path in G starting 
from a fixed source node s and of length I (i.e., the considered path contains I edges). The graph G can be 
unweighted as well as weighted. 

In the following, when it does not generate confusion, we will avoid subscripts to denote both nodes and 
edges and, therefore, we will denote a node as v (rather than t;„) and an edge as e (rather than e™). 

Let us assume that the sequence of nodes forming (p^i is (jisi = {s, mi, . . . , mj-i} (note that s = wq); in 
addition, let us denote as P{4>si) the probability of generating the path (j)gi by simulating a simple path of 
length I. In [l^ the authors show that, in case G is unweighted, the value of P{(t>si) is as follows 



1 

Here 0{uj) is the set of nodes adjacent to uj (i.e., a node v belongs to 0{uj) if there is an edge joining uj 
to v). 

In an analogous fashion, it is possible to consider the case of weighted graphs. In detail, let W{u,v) be the 
weight of the edge going from the node u to the node v; in such a case, on the wake of the considerations 
presented in [57], we can derive the following expression for P{(f>si) 

fj{ E.eo(«,_i)-{^.....«,-2} W{u,^i,v) 

We are now able to re- write the expression of edge centrality index L'^{e) in terms of P{4>si)- In detail, given 
an edge e € E and a path 0^;, we will use the notation e e (j)si if the edge e belongs to the path (psi; we can 
therefore define a variable x(e € (j)si) as follows 



/ ^ , X / 1 if e e 
^(^ ^ <^«') = I otherwise 

Due to these definitions, it is possible to show that the edge centrality of an edge e can be rewritten as 
follows 



x(e e 



(18) 



The interpretation of Equation (181 is as follows. To compute the edge centrality of an edge e we start by 
fixing an arbitrary source node s. We consider a simple path c/jgi starting from s of length /. The path (f>si 
contributes to the centrality of e only if it contains e itself. This is captured by the product c/jgi ■ x(e G 4>si) 

bsi, then x(e e 

si)- 



in Equation (18 1: in fact, if e e 
to L^{e) is equal to P{ 



;) = 1 by definition and, therefore, the contribution of i 



and, therefore, the path (p^i does not provide any contribution 



By contrast, if e ^ (f>si, then x(e € 
to the computation of L^{e). 

Due to Definitionjsj in the computation of L^{e) we are interested in a// the simple paths up to length k; this 
explains why, in Equation ( 18 1, we need a double sum over all the simple paths of length / being 1 < ^ < k. 



Moreover, Definition [3] requires to consider all the nodes s €V as potential source nodes and this explains 
the third sum appearing in Equation (18). 
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It is also interesting to observe that the term X]i<i<K , P{4'si) 'Xi^ ^ 4'si) can be hnked to the probabiHty 
of selecting an edge e € E under the assumption that a simple random path starts from a fixed vertex v £ V. 
This is expressed by the following theorem: 

Theorem 6.1. Let G — (V, E) he a graph, s V he a node in G and e £ E be an edge in G. The probability 
Pe,s oj selecting the edge e by means of a simple random path starting from s is P^.s — X]i<i<K , P{4>si) ■ 
X(e e (j)si)- 

To prove this result, let us focus on a vertex s e V and on an edge e G E. We can consider three cases: 

Case 1. There is no simple path of length I < k starting from s and containing e. In such a case, the term 
x(e G 4'si) will be always and, therefore, the value of Pg.s will be 0. Such a result is correct because, in 
this case, the probability of selecting e is clearly 0. 

Case 2. There exists exactly one path 0*; containing the edge e; if this path is selected, then, the edge e will 
be selected too and then the term x(e € (j)si) will be equal to 1. The probability of selecting the edge e will 
be, therefore, equal to the probability of selecting the path 0*j passing through e. In such a case the term 
Pe,s would simply be equal to P^^s = P{4'*si) which coincides with the probability of selecting e. 

Case 3. There are multiple paths starting from s and going through e. In such a case, the probability of 
selecting e is equal to the probability of selecting at least one of these paths. Since the paths are generated 
one by one, the probability Pe,s of selecting e is equal to X]i<;<k S^^, Pi.'t'si)- 

Once we provided a formal definition of edge centrality we are interested in analyzing the centrality value 
generated by our algorithm. Let us focus on an edge e and observe that our algorithm performs p trials 
and, in each trial, it generates a simple random path of at most k edges. Let us consider the i-th trial and 
observe that the edge e can be selected in the i-th trial or not; of course, since the path must be simple, the 
edge e can be selected no more than once in a trial. To model the selection of an edge e in the generic, i-th 
trial, we define the random variable Xi{e) as follows 



1 if e has been selected in the i-th trial 
otherwise 



Recall that our ERW-KPath algorithm (along with its weighted version WERW-KPath) initially awards any 
edge by assigning it a centrality index equal to j-gy. Any time an edge e is selected, it gets an additional 
award equal to /3 = -g; as a consequence, since the number of times the edge e is selected is equal to 
^^^■^ Xi{e), the value aj(e) returned by the algorithm is equal to 



p 



-(^) = E (19) 



Our goal is to show that the ERW-KPath and WERW-KPath algorithms provide a "good" approximation 



of E^{e). This is formalized by Theorem 6.2 



Theorem 6.2. Let G — {V,E) be a graph and, for each edge e £ E, let L'^{e) be the n-path edge centrality 
index of e computed according to Definition^ Finally, let p be an integer. The following results hold true: 

1. The edge centrality value a;(e) computed by the ERW-Kpath algorithm on G is related to the actual 
centrality value L'^{e) by the following relation: uj{e) — jj^ + \e\\v\ '^"(^) 

2. The edge centrality value io{e) computed by the WERW-Kpath algorithm on G is related to the actual 
centrality value L'^{e) by the following relation: £^ L'^{e) + < w(e) < ^ L'^i^) + pjyj being ^ and ^ 
two suitable constants whose value is proportional to the ratio 
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Proof. We shall consider two cases, depending on the fact that we decide to apply the ERW-Kpath or 
WERW-Kpath algorithm. 



Case 1: ERW-Kpath. Let us compute the expectation of both the members of Equation (19). Due to the 
linearity of the expectation operator we get 

, , E\XAe)\ 1 

Observe now that, since w(e) is a fixed value computed by our algorithm, then _E[a;(e)] = w(e). As for 
E[Xi{e)] it is simply equal to P{Xi{e) = 1) due to the definition of expectation 

E[X,{e)] = • P{X,{e) = 0) + 1 • P{X,{e) = 1) = P{X,{e) = 1) 

Observe that P{Xi{e) = 1) is the probability of selecting the edge e. Observe that the ERW-Kpath algorithm 
manages an overall number of source nodes s equal to |y | and that each node is selected uniformly at random. 



In addition, due to Theorem 6.1 once s has been fixed the probability of selecting e starting from s is equal 



to J2i<i<K J2<h P{4>si) ■ x(e e 0s/); the probability P{4>si), in the case of the ERW-Kpath algorithm, has to 



be intended as in Equation ( 16 ) 



Due to these reasons, we get that 



1<1<K 0el 



which can be rewritten as 



P{X,{e) = 1) = ^L^{e) 



Due to this result we can write 



After some simplifications we get 



which states that the actual value of L'^{e) differs from that computed by our algorithm by a constant factor. 

Case 2: WERW-Kpath. The proof in this case is analogous to Case 1. In detail, by repeating the consider- 
ations provided in Case 1, we can show that 

^ P{X,{e)^l) , 1 

In such a case, however, the expression for P{Xi{e) = 1) is slightly more complex than in Case 1. In detail, 
in the WERW-Kpath algorithm the source node s is selected with probability P{s) provided in Equation 



(111. Therefore, we get 
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and the term P{(j)si) is now computed according to Equation 
Set P = maxsgv P{s) and p = miuggv P{s); we get the fohowing bounds 



17) because the graph G is now weig htecH 



The last equation can be rewritten as 



pi"(e) < P{X,{e) = 1) < PL«(e) 
and, by summing over all the indexes i ^ 1 . . . p 



ppL^ie) <Y.P{X.,{e) = 1) < pPL'^ie) 



1=1 



and 



1 



By setting ^ = and ^ 



which ends the proof. 



and by Equation (|19|), we obtain 
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